Open-Set Language Identification
نویسنده
چکیده
We present the first open-set language identification experiments using one-class classification models. We first highlight the shortcomings of traditional feature extractionmethods and propose a hashing-based feature vectorization approach as a solution. Using a dataset of 10 languages from different writing systems, we train a One-Class Support Vector Machine using only a monolingual corpus for each language. Each model is evaluated against a test set of data from all 10 languages and we achieve an average F-score of 0.99, demonstrating the effectiveness of this approach for open-set language identification.
منابع مشابه
Out-of-Set i-Vector Selection for Open-set Language Identification
Current language identification (LID) systems are based on an ivector classifier followed by a multi-class recognition back-end. Identification accuracy degrades considerably when LID systems face open-set data. In this study, we propose an approach to the problem of out of set (OOS) data detection in the context of open-set language identification. In our approach, each unlabeled i-vector in t...
متن کاملمقایسه روش های طیفی برای شناسایی زبان گفتاری
Identifying spoken language automatically is to identify a language from the speech signal. Language identification systems can be divided into two categories, spectral-based methods and phonetic-based methods. In the former, short-time characteristics of speech spectrum are extracted as a multi-dimensional vector. The statistical model of these features is then obtained for each language. The ...
متن کاملAnalysis of multitarget detection for speaker and language recognition
The general multitarget detection (open-set identification) task is the intersection of the more familiar tasks of close-set identification and open-set verification/detection. In the multitarget detection task, an input of unknown class is processed by a bank of parallel detectors and a decision is required as to whether the input is from among the target classes and, if so, which one. In this...
متن کاملFinding and Identifying Text in 900+ Languages
This paper presents a trainable open-source utility to extract text from arbitrary data files and disk images which uses language models to automatically detect character encodings prior to extracting strings and for automatic language identification and filtering of non-textual strings after extraction. With a test set containing 923 languages, consisting of strings of at most 65 characters, a...
متن کاملLeveraging the open source ispell codebase for minority language analysis
The ispell family of spellcheckers is perhaps the single most widely ported and deployed open-source language tool. Here we describe how the SzóSzablya ‘WordSword’ project leverages ispell’s Hungarian descendant, HunSpell, to create a whole set of related tools that tackle a wide range of low-level NLP-related tasks such as character set normalization, language detection, spellchecking, stemmin...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1707.04817 شماره
صفحات -
تاریخ انتشار 2017